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Abstract. The selection of the most suitable evolutionary model to analyze 
the given molecular data is usually left to biologist's choice. In his famous book, 
J Felsenstein suggested that certain linear equations satisfied by the expected 
probabilities of patterns observed at the leaves of a phylogenetic tree could 
be used for model selection. It remained open the question regarding whether 
these equations were enough for characterizing the evolutionary model. 

Here we prove that, for equivariant models of evolution, the space of distribu- 
tions satisfying these linear equations coincides with the space of distributions 
arising from mixtures of trees on a set of taxa. In other words, we prove that 
an alignment is produced from a mixture of phylogenetic trees under an equi- 
variant evolutionary model if and only if its distribution of column patterns 
satisfies the linear equations mentioned above. Moreover, for each equivariant 
model and for any number of taxa, we provide a set of linearly independent 
equations defining this space of phylogenetic mixtures. This is a powerful tool 
that has already been successfully used in model selection. We also use the 
results obtained to study identifiability issues for phylogenetic mixtures. 



1. Introduction 

In phylogenetics, the goal is to reconstruct the ancestral relationships among 
organisms. Most of the widely used phylogenetic reconstruction methods are 
based on mathematical models describing the molecular evolution of DNA. The 
problem of choosing the most suitable model for the given data is usually left to 
biologist choice, and there are no fully reliable methods to choose the best model 
(cf. [PosOl]). 

In this paper we address the following question: in accordance to Darwin's 
theory that evolution occurs following a tree, how can the data coming from a 
particular evolutionary model be characterized? In other words, are there any 
invariants of the DNA patterns that have evolved following a tree (or a mixture 
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of trees, as we will see below) under a particular model? The answer to these 
questions leads to a complete characterization of the evolutionary model and 
therefore can be used as a selection criterion for the most suitable evolutionary 
model for the given data. 

Here we explain our motivation to solve this problem. It is well known that 
the expected probabihties of nucleotides observed at the leaves of a phylogenetic 
tree satisfy a collection of equalities if the tree evolves under certain models (see 
for instance [Fel03, p.375]). It was already pointed out by [FL92], [SHSE92] or 
[Fel03], that these equalities (referred to as linear invariants) could potentially 
be used to test the model of base change; but, how could it be guaranteed that 
there were no more equalities to be used? 

We answer the questions above for equivariant models of molecular evolu- 
tion ([DK09], [CFSll]). These arc Markov processes on trees whose transition 
matrices satisfy certain symmetries and include Jukes-Cantor model, Kimura 2 
and 3 parameters, strand symmetric model and general Markov model. Our main 
result in section 4 states that, if evolution occurs according to trees (or even mix- 
tures of trees), then the model of evolution is determined by a linear space. By 
exhaustively studying the group of symmetries of these models, we also give an 
easy and combinatorial way of determining the equations of this linear space. In 
the paper [KDGCll] these equations are successfully used for model selection in 
phylogenetics. 

Our main technique consists in proving that the linear space above coincides 
with the space of phylogenetic mixtures evolving under the model M.; that is, 
the set of points that are a linear combination of points lying in the phylogenetic 
varieties CV^ (see section 2 for an explanation on this terminology). In biologi- 
cal words, this is the set of alignments whose columns have evolved following the 
model M. under a phylogenetic tree (not necessarily the same tree in the whole 
alignment, and not necessarily the same transition matrices). In phylogenetics, 
the hypothesis that the sites of an alignment are independent and identically 
distributed is often used in the most simple models. When one removes the as- 
sumption "identically distributed" and replaces it by "distributed according to 
the same evolutionary model" then one obtains a phylogenetic mixture. This 
phenomena is needed to explain heterogeneous evolutionary processes when data 
comprises multiple genes or selected codon positions. Among a plethora of ap- 
plications, phylogenetic mixtures are used in the orthology prediction, gene and 
genome annotation, species tree reconstruction or drug target identification. In 
the usual setting, phylogenetic mixtures are modeled by assuming rate variation 
across sites (see [SS03] for more information on these concepts). 



THE SPACE OF PHYLOGENETIC MIXTURES FOR EQUIVARIANT MODELS 



3 



As a byproduct we are able to determine the dimension of these hnear 
spaces and use it to give an upper bound on the number of mixtures that can be 
used in a phylogenetic reconstruction problem. This is part of the identifiability 
issue in phylogenetic mixtures: determine conditions that guarantee that the 
model parameters (trees and continuous parameters) can be recovered from the 
data. This problem is crucial for justifying methods as maximum likelihood 
and has been extensively studied recently, but only few results are known (see 
for instance [AR06a], [APRSIO], [SV07], [RS],[CH11]). We explain this topic in 
detail in section 5. 

The main tools used in this work are algebraic geometry and group theory. 
We refer to [Har92] and to [Ser77] for general references on these topics. 

2. Preliminaries 

Let n a number and denote by [n] the set {1, 2, .... n}. For biological pur- 
poses, we think of [n] as set of DNA sequences associated to certain taxa and we 
consider trees as connected acyclic graphs whose n leaves are bijectively labelled 
by the set [n]. Let T„ be the set of tree topologies (up to isomorphism) whose 
leaves are labelled by [n]. Trees in T„ are allowed to have any degree in its inter- 
nal vertices. When the internal vertices of a tree T e T„ have degree 3, we say 
that the tree is trivalent. 

Here we introduce the definitions needed in the sequel. 

Wc fix an ordered set B = {61,62, • • • ,bk} and we think of it as a basis 
of a vector space W := (-B)c- For example, for most applications we use B — 
{A, C, G, T} and think of its elements as nucleotides in a DNA sequence. 

Definition 2.1. A phylogenetic tree on VF is a tree T that has the vector space 
Wy :— W associated to each vertex v of T. Usually the same notation T is used 

to represent both the graph and the phylogenetic tree. Elements of B at the 
vertices of T are thought as states of discrete random variables of the vertices. 

Definition 2.2. Let T be a phylogenetic tree on W and assume that a distin- 
guished vertex r of T (usually called the root) is given, inducing therefore an 
orientation in all edges of T. If e is an edge of T, we write Cq and Ci for the 
origin and final vertices of the edge e, respectively. An evolutionary presentation 
of a phylogenetic tree T is a vector tt = (vTbi, tt^j, . . . ,7ibk) £ ^r, together with a 
collection of maps A = {^^°'^^)eeE{T) where each A^°'^^ belongs to Hom(H^eo, W^eJ- 

From now on, we will identify vectors in W with its coordinates in the basis 
B written as a column vector. Similarly, we will identify the set Hom(l^, W) with 
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the set of matrices with k rows and k columns and entries in the complex field 
by mapping any linear map to its matrix in the basis B. We take the convention 
that the matrices A = A'^°''^^ in an evolutionary presentation act on W from the 
right (i.e. the action is a;* e Weo ^ ^jJ*A e Wei)- 

From now on, the vector (1,1,. ...1) G W will be denoted by 1. 

Definition 2.3. An algebraic evolutionary model M. is specified by giving a vector 
subspace Wq C W such that I^tt 7^ for every tt 7^ in VFo, together with a 
multiplicatively closed (grupoid) subspace Mod of Hom(l^, 1^). We will usually 
denote it by Al = (Wq, Mod). 

If T is a rooted phylogenetic tree on W, then T evolves under the algebraic 
evolutionary model Ai if its evolutionary presentations lie in Mod and the vector 

TT at the root belongs to Wq. We denote by Par^(T) — Wq x (Ylg^E(T) ^od^ . 

Remark 2.4. The condition I^tt 7^ for every tt G Wq in the definition above 
means that the sum of the coordinates of the vectors in Wq is different from zero, 
unless the vector is already 0. Since the vectors in Wq represent the possible 
distributions for the root in the tree T, this condition is biologically meaningful 
and implies no significant restriction for our evolutionary models. 

Below we give some well known examples of evolutionary models. 

Definition 2.5. Let G be a permutation group of B (that is, a group whose 

elements are permutations of the set B, G < &k)- Given g & G, write Pg for 
the k X A;- permutation matrix corresponding to g: {Pg)i.j = 1 if g{j) = i and 
otherwise. The G-equivariant evolutionary model is defined by taking Mod equal 
to liomG{W,W), that is, 

RomaiW, W) = {A e Mk,k{C) \ APg = PgA^g e G} 

and Wq = {n E W \ Pgii = 7r\/g e G}. It is clear that the above subsets 
define vector subspaces of Hom(l^, VF) and Wq. On the other hand, if ^1,^2 £ 
HomG(W^,W^), then 

PgA,A,p-' = (PgA,P;')(PgA,p-') = A,A, 

so A1A2 G Home (14^, 1^). Therefore, equivariant models provide a wide family 
of examples of algebraic evolutionary models in the sense of Definition 2.3. For 
example, if i? = {A, C, G, T} and 

• G = 64, this is the algebraic Jukes-Cantor model JC69, 

• G = ((ACGT), (AG)), this is the algebraic Kimura 2-parameter model K80, 

• G = ((AC)(GT), (AG)(CT)), this is the algebraic Kimura 3-parameter model 
K81, 
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• G = ((AT)(CG)), it is known as the strand symmetric model SSM, and 

• G = {id) , this is the general Markov model GMM. 

For an equivariant model A4 we denote by Gm the corresponding group. 

Notation 2.6. If a; = (cui, . . . , Uk) is a vector in W, then D^j denotes the square 
matrix with ou on the diagonal and zeros elsewhere. 

Example 2.7. ([AR06b]) Let tt e be such that 7r*l 7^ 0. The -k- stable base 
distribution model (tt-SBD) is given by taking Wq = (tt) C and letting Mod be 
the set matrices for which vr is a left eigenvector of every A G Mod. Notice that 
the TT-SBD satisfiy the conditions of algebraic evolutionary model of Definition 2.3: 
Mod is a vector space and it is multiplicatively closed. By adding the conditions 
that TT has eigenvalue 1 for every A e Mod and the entries of each row of each 
matrix A sum to one, we obtain the stable base distribution model defined in 
[AR06b] (see Definition 2.14). 

For example, JC69, K80 and K81 are examples of tt-SBD submodels with 
TT = (1,1,1,1). The model SSM is not an SBD model since martices in Mod do not 
share a fixed left eigenvector. 

Definition 2.8. Given a phylogenetic tree T on H^, T e T„, an [ri\-tensor is any 
element of 

Notation 2.9. We will denote hy B — B"^ the set of n-words in B, 

B = {X= (xi,...,x„) : Xi e B}. 

For the sake of simplicity in our notation, sometimes it will be convenient to 
identify every word X = (xi, . . . , x„) with the tensor Xi . . • x„ G £ and conse- 
quently, we will identify B with the natural basis of C Notice that a distribution 
P = iPbi...bi, ■ ■ ■ ,Pbk...bk) the set of patterns in B at the leaves of a tree can be 
viewed as the tensor in jC having these coordinates in the basis B, that is 



Definition 2.10. Given an algebraic evolutionary model Ai, the parametrization 
of a rooted phylogenetic tree TonW evolving under the model A4 is the map 



that correspond to a hidden Markov process on the tree T when we restrict 
to stochastic matrices and distributions in Wq (leaves correspond to observed 
random variables and the interior nodes to hidden variables). That is, if the tree 




: Par^(T) 
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is rooted and directed from the root r, then the parametrization of T is the map 

(tt, A) = ^ Pxi...x„xi ® • • • (8) x„ 

Xi6B 

where 

\^) Pxi...X„ — '^^^ 11 ^Xpa(^),X^ , 

Xy£B,veInt{T) veN{T)\{r} 

x„ denotes the state at the vertex u, pa{v) is the parent node of f , tt = (vrx)xeB 
are the coordinates in the basis 5 of a vector associated to the root, and if v = i 
is a leaf node, then x^ = Xj. 

Note that the position of the root plays a role in the above parameterization. 
However, the following lemma shows that under some assumptions, its image is 
independent of it. Let T„ be a tree rooted at a vertex u, and consider a vertex 
V adjacent to u. Then, rooting T„ at v induces the opposite orientation on the 
edge e delimited by u and v. Write for the new oriented rooted tree. Then we 
have the following lemma whose proof is left to the reader. 

Lemma 2.11. Let Tu be evolving under an algebraic evolutionary model M. = 
(Wq, Mod). Let (tt. A) be an evolutionary presentation on Tu such that n has all 
its entries different from and let tt* — tt^A^. Assume also that all the entries of 
TT e Wq are different from and D~^{A^yDT^ belongs to Mod. Then, letting 

Xe._=[ D~\AyD^, zfe = e, 
1 A^, otherwise 

and A = (i^)eeE(T,), we have *^(7r. A) = *^(^, A). 

Therefore, for the models satisfying the conditions of the previous lemma 
the image of the map does not depend on the position of the root. We call 
these models root-independent models: 

Definition 2.12. We say that an algebraic evolutionary model M. — {Wq, Mod) 
is root-independent if it satisfies 

(i) TT* := TT^A belongs to Wq for all n eWo and all A e Mod, and 

(ii) D~^A^Dt^ e Mod whenever does exist. 

Example 2.13. Equivariant models as well as the stable base distribution models 
are root-independent (we let the reader check it for equivariant models and we 
refer to [AR06b] for the SBD model). 

Definition 2.14. A stochastic evolutionary model sAi is specified by a subset 
sWq of vectors in W whose entries sum to one, together with a multiplicatively 
closed set sMod of complex matrices whose rows sum to one. 



THE SPACE OF PHYLOGENETIC MIXTURES FOR EQUIVARIANT MODELS 7 

Notice that the rows of a matrix sum to one if and only if 1 is a right- 
eigenvector corresponding to eigenvalue 1. Hence, for a stochastic evolutionary 
model sAi, the space of matrices sMod is not a vector subspace anymore. 

Example 2.15. If = {Wq^ Mod) is an algebraic evolutionary model, define 
sM = {sWo, sMod) as sWq = {tt G : l*7r = 1} and sMod = {A e Mod : 
Al — 1}. Then, sM. is a stochastic evolutionary model. 

Definition 2.16. The parameterization of a rooted tree T evolving under a 
model M. restricts to a polynomial map (f)j^ from 

PaXsM{T) = sWq X I sMod 

to the hyperplane H C C defined by 

H^LeC: Px,..x. = li. 

I, xi,...,x„eB ) 

Prom now on, we will refer to this map as the stochastic parametrization. 

The map restricted to distributions in Wq and stochastic matrices in 
Mod assigns to each set of parameters the corresponding distribution of patterns 
i B at the leaves of the tree and therefore its image lies on the standard simplex 
in C — <^[n]W. This justifies why the image of the stochastic parametrization 0^ 
lies in H. 

We proceed to define algebraic varieties associated to the parameterization 
maps. For background in algebraic geometry see [Har92] . 

Definition 2.17. The affine phylogenetic variety CVj^ associated to a phyloge- 
netic tree T on is 

CV^ := {*^(7r„ A) : (tt,. A) e Y>qxm{T)} 

where the closure is taken in the Zariski topology. Equivalently, CVj^ is the 
smallest algebraic set containing the image of 

The affine stochastic phylogenetic variety Vj^ associated to a phylogenetic 
tree T on is 

:= {</)^(7r„ A) : (tt,. A) e Par,A^(r)} C H 
where the closure is taken in the Zariski topology. 

There is a natural isomorphism between the points lying in the hyperplane 
H = {p = {pb^...bi, ■ ■ ■ ,Pbk...bk) e ^ : E1Pxi...x„ = 1} and the open affine subset 
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{P = : ■ ■ ■ : Pbfc...6j : Z]Pxi...x„ ^ 0} of P''" ^ = F{C) (we use projective 

coordinates [pbi...bi ■ ■ ■ ■ '■ Pbk...bk] distinguish them from affine coordinates). 
The projective phylogenetic variety associated to a phylogenetic tree T on 

W is the closure in f^"-'^ = P(£) of the image of the stochastic parameterization 
defined above. 

The aim now is to study the relation between the above varieties. As it 
is usually easier to deal with a homogeneous parameterization and homogeneous 
polynomials, it will be useful to prove that CVjy^ is the cone over FV^. This 
is known for some particular models (for instance, see [AR08] for a proof on the 
general Markov model) but as our definition of algebraic evolutionary model is 
quite general, we would like to prove it in its maximum generality. 

We will use the following notation: 

Notation 2.18. Given a matrix A, we will write Da = diag{Al). Given p = 
(Pxi,...,x„)xi,...,x„ e £, write \{p) = Ex.eBPxi,...,x„- For example, we can write the 
hyperplane H above as H = {p & C : \{p) = 1}. 

The following proposition is an adaptation of [AR08, Proposition 1] to our 
models. 

Proposition 2.19. Let Ai = {WQ,Mod) be a root-independent evolutionary 
model and let T be a trivalent n-leaf rooted tree on W evolving under M.. Fix an 
edge path 'y in T with initial vertex the root ofT and such that is passes through 
all vertices in the tree. Then there is a non-empty open set C Par_A4(T) such 
that, ifp = ^^(tt. A) with (vr. A) G t/^, 




Figure 1. 
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(i) there is a presentation (tt, A) e Par m{T) with stochastic for every 

edge e in the tree and p = ^j^(7r, A); 
(a) there exists q e Im0^ such that p = X{p)q. 

Proof. Given A e Mod and tt G Wq, write tt = AV. Since the model is root- 
independent, the matrix A — D-^A^D^^ is still in Mod and all entries in tt are 
different from 0. In this case notice that A is stochastic: 

Al^{Dz^A'D^)l^D-l{A'n)^l. 

The idea for the proof is to move the root from one vertex of the tree to another 
according to the edge path 7, replacing at each step some matrix A^ by a new 
matrix A'^ which is stochastic. In order to construct this new matrix, some 
conditions must be required to the original presentation (tt, A). 

Write 7 = (ei, 62, . . . , e^) for the ordered collection of edges in 7 so that ei 
contains the root and the terminal vertex of Cj-i equals the initial vertex of Cj. 
Notice that the edge path may go twice through the same edge and, in this case, 
this edge will appear in the collection above with opposite orientation. 

The root of T and each edge Cj provide a number of conditions on Par;n(T) 
as claimed, namely: 

Root r: vr has all its entries different from 0. 

Edge ei : tti := {A'^^Ytt has all its entries different from 0. 
Then, write = D-^{A"'yD^. 

Edge Ci, 2 < i < m : TTi := (74^ii)Vj_i has all its entries different from 0. 
Then, write A^!,, = 

Define as the set in Par^^ (T) defined by all these conditions. It is a nonempty 
Zariski open set of Par_A4(T) Moreover, for any (tt. A) e U^, the inverse of 
and the consecutive inverses of D^^. are guaranteed to exist. Therefore, we apply 
Lemma 2.11 to move the root through the path 7 and obtain a new presentation 
(tt. A) in Ai satisfying the conditions of (i) . 

We can assume therefore that p = \t';^(7r, A) with A^ stochastic for every 
edge in the tree. Since tt G Wq, we know that A := iV 7^ 0. Define a new 
distribution for the root by tt = ^ G Wq and write A = A. We have that 

(tt , A) G sA^ and we define q = 0j^(7f, A). We have to show that p = Xq and 
A = A(p). 

Write qxi,...,xn fo^ its coordinates in the basis B. Similarly, write Pxi,...,x„ 
for the coordinates of p. Identifying the vectors x = ^j=i Q^? £ W with its 
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coordinates in the basis B, (ci, C2, . . . , c^), we write x*M to mean (ci, C2, . . . , Ck)M. 
Applying (1), we obtain that have that 

2n-3 

Pxi...x„ = XI n ^eP^'^eJ 

yLveB,v€lnt{T) i=l 

2n-3 

= A JJ X*oI*Xei = Ag^,...^„, 

the second equahty by the definition of tt. Moreover, since X^?xi,...,x„ = 1, we 
infer that A = Sxi...x„ i^xi...x„, that is, A = A(p). □ 

Given a set Z C £, denote by X(Z) the ideal of polynomials in C[£] := 
C[pxi,...,x„] that vanish over Z. From the proof above, we state the following 
facts for future reference. We prove the following corollary relating the different 
phylogenetic varieties defined above. 

Corollary 2.20. Let M. = {Wq, Mod) be a root-independent evolutionary model 
and let T be a trivalent n-leaf tree on W evolving under M. Then, 

(a) CVj^ equals the affine cone over the projective phylogenetic variety PV^; 

(b) X(Im^^) + {h) = X(lm0^), where h = EPx„...,x„ - 1; 

(c) V^^CV^nH. 

Consequence (a) was proved by AUman and Rhodes for the general Markov 
model (see [AR08, Proposition 1]). 

Proof, (a) Since Im^j^^ is a cone, the ideal of polynomials vanishing on it has to 
be homogenous. Now, by virtue of Proposition 2.19 we know that a homogenous 
polynomial vanishes on Im^'jf' if and only if it vanishes on Im^jf*. It follows 
immediately that CV^^, which is the variety defined by all these polynomials in 
£, equals the affine cone over FYj!^. 

(b) One inclusion is easy. To prove the other inclusion, let F e X(Im$^), so 
that F vanishes on V^^. If d is the degree of F, write F — F„, + -P'm+i + ••• + -^d, 
where every Fj is a homogenous polynomial of degree j < d. Let p G lm(\E'ji^) 
and, using Proposition 2.19 write p = \{p)q, where q G lm(0jf'). Then, we have 

FM ^ Fm+ijp) ^ ^ Fd{p) 

\{pY'- a(p)'"+i ■■■ x{pY' 
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On the other hand, the equation of i7 is /i = where h{p) = X{p) — 1. That is, 
\{p) = h{p) + 1. Replacing this in the above equation and multiplying by A(p)'^, 
we obtain that the polynomial 

F{p) - FM{Kp) + 1)'""* + F^+i{pmp) + 1)'^""-^.. + Fa{p) = 

is identically zero on Im^^:^, that is, F G T(Im\I'j^). This polynomial has the 
form F = F + hQ for some polynomial Q G C[C]. From this, we immediately 
have F = F — hQ G X(Im\E'jf^) + (h) and we are done. 

(c) follows directly by taking the affine varieties defined by the ideals in (b). □ 

The previous Corollary imphes that dimCV^ = dimPV/^ + 1, and if 
P = (Pa...a,---,Pt...t) belongs to CV^, then g := [pa. . .a : • • • : Pt...t] belongs 
to TVP. Moreover, if A := EPxi...x„ 0, then q = ^] and 

. . . , ^^^j^) is a point in the affine stochastic phylogenetic variety V^. 

3. The space of phylogenetic mixtures 

In phylogenetics, the hypothesis that the sites of an alignment are inde- 
pendent and identically distributed is often used in the most simple models. 
When one removes the assumption "identically distributed" and replaces it by 
"distributed according to the same evolutionary model" then one obtains a phy- 
logenetic mixture. Here we introduce a phylogenetic mixture from the algebraic 
point of view. 

Definition 3.1. Fix a set of taxa [n] and an algebraic evolutionary model Ai. A 
phylogenetic miocture (on m-classes) or m-miocture is any vector p & £. — <S>[n]W 
of the form 

m 

1=1 

where p' G Im(^^), Ti G T„ and G C. As is a homogeneous map, phylo- 
genetic mixtures are actually vectors of the form Y^^iP\ where G Im(^^). 

Note that on a phylogenetic mixture we allow some (or all) tree topologies 
Ti to be the same. Therefore, the widely used discrete Gamma-rates or any type 
of rate variability across sites are instances of phylogenetic mixtures (we refer to 
the book [SS03] for an introduction to these concepts.) 

We denote by Vj^ C C the set of all phylogenetic mixtures (on any number 
of classes) under the algebraic evolutionary model Ai and by V'J^ the set of all 
phylogenetic mixtures on m-classes. 



12 MARTA CASANELLAS, JESUS FERNANDEZ-SANCHEZ, AND ANNA KEDZIERSKA 

When we restrict to matrices whose rows sum to one so that we consider 
the parameterization 4>j^, one has to restrict the phylogenetic mixtures to points 
of the form 

m 

q — where e lm(0^) and ^ctj = 1. 

i=l i 

We call VsM the space of these stochastic phylogenetic mixtures. 

The following result was proven by Matsen, Mossel and Steel in [MMS08] 
for the two state random cluster model. 

Lemma 3.2. Given a set of taxa [n] and an algebraic evolutionary model M., 
the set of all phylogenetic mixtures V_m is a vector subspace of C. Similarly, the 
space T>sM 'is a linear variety of the affine space C contained in the hyperplane 
H. 

Proof. T>M is a C- vector space by definition. 

In order to prove that T>sm is a linear variety, let go be any point in T>sm-i 
so that go = X^ili CKi?' with g* e lm(0j^), i = 1, . . . , m, and 'Ylii — 1- Then we 
can write 

T^sM = (lo + F, where F = {q^ \ q e Vsm}- 
We only have to show that F is a C- vector space: 

1) Let V — go^ be a vector in F, then Xv — qoq' where g' = go + Ago^. 
This last point is in Vsm- H q — Yl\=iPjQ^ '^i^^ Pj — 1, then q' — 

(1 — A) Yl^i ^i't + ^ Sj=i '^i^"' ^'^'-^ ^^'^ scalar coefficients sum to one 
(1 - A) Ei «i + A E,- 1^3 = (1 - A) + A = 1. Therefore Xv is in F. 

— \ — \ 

2) Let vi — q^q and V2 — q^q be two vectors in F, 

^^^Y^ with ^ /3j = 1, 

j 

<f ^Y IkQk with ^ = 1, 

k 

then vi+ V2 = g°g' with g' = J2j PjQj + HklA - J2iOiiqi, and all 
coefficients together sum to one: Ej + Efe 7fe ~ Ei cti = 1. 

□ 

Remark 3.3. By virtue of the previous lemma, T>m is an algebraic variety that 
contains Im^^ for any tree T and therefore, it also contains CV^. It follows 
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that Vj^ equals the set of points of the form p — where G CVj^- Simi- 
larly, T>sM equals the set of points of the form q — ^aiQi, where Qi e 1^ and 
Ei = 1- 

For technical reasons needed in the next result, we introduce the following 
spaces: 

Definition 3.4. Define as the set of points p of the form p = El^ii'* where 
e CVjy^, and as the set of points q of the form q — Yl^i'^i^f where 

Lemma 3.5. The following equalities hold 

(b) v,M ^VMnH. 

Proof, (a) Let q e "D^- Then, we can write q — EI^i '^i^f ^r some q^ e Vj^ and 
X^CKi = 1. Clearly, q e VJ^. Moreover, A(g) = ^i^iXiq^) = Ei^i = 1- Thus, 
qeH. 

Conversely, let p = Ei^iP* with p* e CV^^ for certain tree topologies Tj, 
and assume that A(p) = 1. Apply Proposition 2.19 to each to get p* = A(p*)gi 
for some qi € Vj^. Then, we have 

p^J2p'-T.^(p')^^ 

i i 

and 1 = A(p) — J2iKP^)KQi) — Ej'^(P*) since each qi lies on H. This proves 
that p e V^. 

(b) can be proven using (a) and Remark 3.3. □ 

4. The space of phylogenetic mixtures for equivariant 

evolutionary models 

This section will be devoted to give a precise description of the space Vm 
for the equivariant models A4 hsted in 2.5. Thus, we will assume that B — 
{A, C, G, T}, /c = 4 and W ^ {B)c. Prom now on, n is fixed and C = 

Let G < S4 be a permutation group. We consider the restriction to G of 
the defining representation 



(2) 



p : 64 ^ GL{W) 
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given by the permutation of the elements of B. This representation induces a 
G-module structure on W by setting 

g-x:^ p{g){x) G W. 

In fact, p induces a G-module structure on £ = (g)"l^ by setting 

(3) • (xi (8) . . . (8) x„) := • xi (8) . . . (8) • x„. 

and extends by hnearity. According to Notation 2.9, if X e S and g e G, gX will 

stand for the action of on X as introduced above. From now on, the space JC 
will be implicitly considered as a G-module with this action. 

Definition 4.1. Given a set of taxa [n], a G-tensor on [n] is an [n]-tensor in- 
variant by the action defined in (3). The set of G-tensors will be denoted by 

The following Theorem describes the set of phylogenetic mixtures for equi- 
variant models in an easy way. 

Theorem 4.2. If M. is one of the equivariant evolutionary models JC69, K80, 
K81, SSM or GMM, then the space of phylogenetic mixtures T>m coincides with L'^^ 
and VsM = jO.'^^ n H. 

This theorem allows one to identify the set of all phylogenetic mixtures T>m 

with CP'^, which is a vector subspace of L whose linear equations are easy to 
describe, as wc will see afterwards in this section. In other words, CP^ is the 
space where data coming from any mixture of trees evolving under model lies. 
One can therefore use CP^ to select the most suitable model for given data. This 
has been studied in the paper [KDGCll] by the first and third author jointly with 
M. Drton and R.Guigo. 

Proof of Theorem 4- 2- In Lemma 3.2 we proved that is a vector subspace of 
C Moreover, as we are considering equivariant models, we have Im(\l/jf') C C^-^ 
for any tree T (see Lemma 4.3 of [DK09]) and hence P_a4 is contained in the 
vector subspace C^-^ . 

In order to show that = Dm it remains to prove that there does 

not exist any hyperplane 11 containing Vm and not containing C^^. If such a 
hypcrplane existed, then it would contain, in particular, all points in Im for 
any tree topology T. As 11 is an algebraic variety, this implies that 11 contains 
GVrp^ for any tree topology T. 

It is enough to prove that, for the equivariant models considered here, there 
are no homogeneous linear polynomials vanishing on all tree topologies, except 
the linear equations vanishing on C^-^. This is already known in the literature: 
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for G corresponding to GMM this was proven in Allman- Rhodes [AR04]; for the 
Strand symmetric model this is in [CS05]; for JC69,K80,K81 this appears in 
[SS05] (for JC69 and K80 there are other hnear relations but they correspond to 
phylogenetic invariants, i.e. they are equations that vanish on for a particular 
tree topology T but not for all topologies) . The main result in [DK09] comprises 
all these results. 

The equality = CP^ fl H follows immediately from Lemma 3.5 and 
the first assertion in this theorem. □ 

4.1. Equations for the space Cp^. Our purpose now is to compute the di- 
mension of CP^ where Jv[ is one of the equivariant models listed in Definition 2.5, 
as well as to obtain a set of independent linear equations defining this space. To 
this aim we need to recall some definitions and facts in group theory and group 
representation theory. 

Let G < S4 be a permutation group. Given an element g & G, the conjugacy 
class of g is G{g) = {h~^gh : h G G}. If gi,g2 G G, then it is easy to see that 
either G{gi) = G{g2) or C{gi) fl C{g2) = 0. If Ci,...,Cs are the conjugacy 
classes for G, write C{G) = (|Ci|, . . . , \Cs\) for the s-tuple of their cardinalities, 
so that J2i=i = 1^1 • Write for the character of G associated to the defining 
representation G — >■ GL{®'"W). Recall that x"(fl'i) = X"(fl'2) whenever gi and g2 
lie in the same conjugacy class, so that we represent x" by a s-tuple {ti, . . . , tg) 
where ti = x"'{g) fo^^ ^''^Y g ^ Ci. 

Example 4.3. (cf. [CFSll]) For the equivariant models listed in Definition 2.5, 
we have the following table (denote by e the trivial permutation of 64): 

G < ©4 M representants of conj. classes C{G) (ti,...,ts) 

((AT)(CG)) SSM {e,(AT)(CG)} (M) (4^0) 

((AC)(GT), (AG)(CT)) K81 {e, (AT)(CG), (AC)(GT), (AG)(CT)} (1,1,1,1) (4", 0,0,0) 

((ACGT),(AG)) K80 {e, (AC)(GT), (AG)(CT), (ACGT), (AG)} (1,2,1,2,2) (4", 0, 0, 0, 2") 

64 JC69 {e, (AC)(GT),(ACGT),(AG),(ACG)} (1,3,6,6,8) (4", 0, 0, 2", 1) 

Let Qg = {^}i=i,...,t be a set of the irreducible characters of G, where ui 
stands for the trivial character. Marshche's Theorem applied to the action of G 
described in (3) states that there is a decomposition of ®"'W into its isotypic 
components: 

(4) <8)'^iy = ©U(®"^)N 

where each ((8)"iy)[c(;j] is isomorphic to a number of copies of the irreducible 
representation Nj, associated to Wj, ((8)"iy)[c(;i] = A^j (g) C"*^"^ for some positive 
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integer mi{n), called the multiplicity of ®'^W relative to uii. Moreover, the set 
VIg forms an orthonormal basis of the space of characters relative to the inner 
product defined by 

(5) {f.h):=^Y.f^9)W)- 



'^1 geG 

Proposition 4.4. We have 

(i) dim£SSM_22"-i. 

(ii) dim£'^^i = 4"-^ 

(iii) dim£'^^° = 22"-3 + 2'^-2 

(iv) dim£JC69 = + 

Proof. Let M be either SSM, K81, K80 or JC69. First of all, notice that the space 
of G^-tensors is just the isotypic component of ®'^W associated to the trivial 
representation, or equivalently, to the trivial character uji: 

£^ = ((g)"iy)[a;i]. 

Since the dimension of the trivial representation is one, it follows that the di- 
mension of is precisely the multiplicity mi(n), that is, the number of times 
the trivial representation appears in the decomposition of ®'^W into isotypic 
components. This multiplicity equals 

where in the last equality we group the elements of G according to their conjugacy 
class. We apply this formula to the different groups described in Example 4.3 
and the result follows. □ 

Our next goal is to provide a set of independent linear equations for C"^^ . 
Prom now on, we will use the notation introduced in 2.9 and we add the following 
notation before stating the main result. 

Notation 4.5. We will consider the following subsets of B = S": 

Bo = {(A,...,A),(C,...,C),(G,...,G),(T,...,T)} 

>Bac|gt = {A,Cru{G,Tr 

B,a\CT = {A,Gru{C,Tr 

BatIcg = {A,Tru{C,Gr 

^2 — BaC|GT U Bag|ct U Bax|CG- 
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The set Bq is composed of all n-words with only one letter and it is contained in 
-BacIgt, -BagIct and -Baticg- Similarly, B2 is composed of all n-words with two letters 
at most. It is straightforward to check that I-BacigtI = I-BagIctI = I-BatIcgI = '^"''^^ 
and I-B2I =3-2"+i-8. 

We will adopt multiplicative notation for the n-words in the alphabet B. For 
instance, we will write C' to mean the word and C . . . C and (A')(G™)xi_|_m+i . . . x„ 

I 

to mean A . . . AG . . . G x;-|-^-|-i . . . x„, where x^+^+i, . . . , x„ represent any possible 

I m 

choice of letters. 

The main result of this section is the following: 
Theorem 4.6. A set of linearly independent equations for is given by 

E^^" .- Px = P(AT)(cG)x where X has xi G {A, C}; 
E"^^^ ; the equations in E^^", together with 

Px = P(AC)(GT)X, 

where X has xi = A; 
E'^^° ; the equations in E"^^^, together with 

Px =P(AG)X, 

where X e B \ Bac|gt has xi = A, and if T appears in X, there is some C in 

a preceding position; 
E'^^^^ ; the equations in W^^°, together with 

Px = P(AT)X, 

where X e Bacigt \ Bq has the form (A^)(C'")x;+^_|.i . . . x„; and equations 

Px = P(Ac)x and px = P(at)x 

where X E B\B2 has the form (A')(C™)x;_|_m+i ■ ■ ■ x„ and, if T appears in 
X, there is some G in a preceding position. 

The number of equations added in each case is: 

SSM : 22"-i; 
K81 : 22"-2; 

K80 : 22"-3 - 2"-2; and 
JC69 : 2'^-! - 1 + 2(^^^^ - 2"-2). 

In order to prove this theorem we need a few technical results. 
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Lemma 4.7. If G ^ {g^, . . . , gt), then = flLi J^^''^ ■ 

Proof. One inclusion is straightforward. We prove the other. Let p e 0'=! '^^^'K 
so we have gip = p for any i (and in particular, g^^p = p). Any element of G can 
be written as = g^^ . . . g^'' with mj 7^ 0. The invariance of p under all the gi 
and g~^ completes the proof. □ 

As a consequence of this lemma, wc obtain that a system of linear equations 
for is obtained from a system of generators of G: given a point p E C, we 
have that 

peL^ ^ Pgy, = py,,ygeGyxeB. 

If is a subgroup of G, we take H \ G = {Hg : g G G} for the set of right 
cosets of H in G, Hg = {hg : h G H}. By Lagrange's theorem, wc know that 
\H\G\ = \G\/\H\. Moreover, if [G : H] is the index of H in G and {gi, . . . ,g[G:H]} 
is a transversal of \ G, we have a partition of G 

IG:H] 

(6) G = U Hg,. 

i=l 

The set H \ G can be understood as a single G-orbit with the natural action of 
G on it. 

Example 4.8. For the models of Example 4.3, we have 

(i) [GssM : (e)] = 2; a transversal of (e) \ Gssm is {e, (AT)(CG)}. 

(ii) [Gk8i : Gssm] = 2; a transversal of Gssn \ G^si is {e, (AC)(GT)}. 

(iii) [Gkso : G^ai] = 2; a transversal of Gksi \ Gkso is {e, (AG)}. 

(iv) [Gjc69 : Gk8o] = 3; a transversal of Grso \ Gjces is {e, (AC), (AT)}. 

Notation. We write {X}g (or even {'^}mg) orbit of X G B under the 

action of G: {X}g ^ {gX : g e G}. 

Lemma 4.9. Let gi, . . . , g^ be a transversal of H\G. For every X & B, we have 

Wg= U {9^^}H. 
i=l,...,m 

Proof. Apply the decomposition (6) to element X. □ 
Lemma 4.10. Let X G B. Then, 

SSM: {X}ss„ = {X, (AT)(CG)X} and there are 22"-^ different orbits. 
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K81: {X}k8i = {X}ssM U {(AC)(GT)X}ssm has cardinality 4 and there are 22"-2 
different orbits. 

K80: o // X e <Bag|ct then {X}k8o = {X}k8i has cardinality 4 o,nd there are 
2"~^ different orbits; 
o iflEB \ -Bagict; then {X}k8o = {X}k8i U {(AG)X}k8i has cardinality 8 
and there are 2^"~^ — 2"~^ different orbits. 
JC69: o If X E Bq then {X}jc69 = {X}k8o has cardinality 4 CLnd there is only 
one orbit; 

oifle Bac|gt \ Bo then {X}jc69 = {X}k8o U {(AT)X}k8o has cardinality 
12 and there are 2"~^ — 1 different orbits; moreover, the union of 
such orbits cover the whole B2\Bq. 

o tfX e B\B2 then {X}jc69 = {X}k8o U {(AC)X}k8o U {(AT)X}k8o has 
cardinality 24 and there are |(2^""^ + 1) ~ 2"~^ different orbits. 

We can summarize this result in the following table: 





{X}gmm 


{X}sSM 


{X}k81 


{X}k80 


{X}jC69 


Bo 


{X} 


•••U{(AT)(CG)X} 


•■■U{{AC)(GT)X}ssM 






Sag|ct 










■ • • U {(AC)X}k80 


^AC|GT 








••■U{(AG)X}k81 


• • • U {(AT)X}k80 


Sat|cg 








••■U{(AG)X}k81 


■ ■ ■ U {(AC)X}k80 


B\B2 




55 


55 


••■U{(AG)X}k81 


• ■ ■ U {(AC)X}k80 U {(AT)X}k80 



where . . . means the set on the left and " means the set on the top. 



Proof. The idea of the proof is to systematically apply Lemma 4.9 to describe 
the orbits of the elements X E B under the action of the groups considered. SSM 
and K81 are straightforward and are left to the reader. 

K80: Applying Lemma 4.9, we obtain that 

{X}k80 = {X}k81 U {(AG)X}k81. 

If X e ^AG|CT, then {(AG)X}k8i = {X}k8i and {X}k8o has cardinality 4. The 
number of such orbits is 

I^AG|Ct| _ ri-l 

4 ■ 

If X ^ Bag|ct, then {(AG)X}k8i 7^ {X}k8i, so {X}k8o has cardinality 8. The 
number of such orbits is 

|B\^AG|Ct| ^ 22n-3 _ 2^-2 
8 

JC69: Lemma 4.9 applies to give 

{X}jc69 = {X}k80 U {(AC)X}k80 U {(AT)X}k80. 
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(a) If X e Bo, then {(AC)X}k8o = {(AT)X}k8o = {X}k8o, so {X}jc69 has 4 
elements. The number of such orbits is 

|Bo|/4 = l. 

(b) If X e Bacigt \ Bo, then (AT)X e Bag|ct and {(AC)X}k8o = {X}k8o has 
cardinahty 8. Therefore, {Xjjceg = {(AT)X}k8o U {X}k8o has cardinal- 
ity 4 + 8 = 12. The number of such orbits is 

|Bac|gt\Bo|/4 = 2"-1-1. 

Moreover, the number of words involved in such orbits is 

12(2"-^ - 1) = 3 • 2"+^ - 12, 

which is the cardinality of B2\Bq. 

(c) Finally, if X ^ B2, then the three orbits {(AC)X}k8o, {(AT)X}k8o and 
{X}k8o have 8 elements each and are disjoint. Thus, we obtain that 

{X}jc69 = {X}k80 U {(AC)X}k80 U {(AT)X}k80 

has 24 elements. The number of such orbits is 

\B\B2\ _ 4" - 3 • 2'^+^ + 8 _ 2^"-^ + 1 _ 2„_2 
24 24 ~ 3 

This proves the claim. 

□ 

Renicirk 4.11. Notice that given a subgroup G of 64, every orbit o — {Xi, . . . , X^} 
described above provides a G-tensor (a tensor invariant under the action of G) 
defined by 

m 

E(o) = J]X,. 

All these tensors are linearly independent, since each orbit involves different vec- 
tors of B. It follows that all together they provide a basis for C^. 

Now, we proceed to prove Theorem 4.6. 

Proof of Theorem 4-6. In all these cases, the equations are obtained by taking 
the corresponding transversals given by Example 4.8. Assume we have computed 
a system of equations for the equivariant model associated with some subgroup 
H < G. By virtue of the previous lemmas we only need to care about the 
permutations added to H to generate G. Lemma 4.9 says that the new G-orbits 
result on the glueing of some if-orbits by the action of the new permutations 



THE SPACE OF PHYLOGENETIC MIXTURES FOR EQUIVARIANT MODELS 



21 



added. Therefore, the new equations to add are obtained by taking a transversal 
{91 ^ e, . . . , giG:H]} oiH\G: 





= P92X 






= P93X 1 


1 for aU XeB 


Px 







To avoid repetitions of equations, we have to choose a single element for every 
G-orbit. Notice that it may happen that for some X e H, {gi'X}H — {9j^}H for 

i 7^ j. In that case, the equality Pg.^ = Pg^x already holds in the space and 
does not provide any restriction. We have to take into account this possibility 
in order to obtain a minimal set of equations. That they form a minimal system 
of equations will follow from their cardinality and the dimension computation of 
Proposition 4.4. 

SSM: As GssM is generated by (AT)(CG), a set of equations defining is 

{Px = P(AT)(CG)x : X G 

Each SSM-orbit provides a single equation. In order to avoid repetitions 
of equations, we take X with xi e {A, C}. All together, we obtain 2^"~^ 
equations. 

K81: As {e, (AC)(GT)} is a transversal of Grsi \ Gssk, 

{Px = P(Ac)(GT)x : X e B}. 

As above, each K81-orbit gives rise to a single equation. To avoid rep- 
etitions, we restrict to X with Xi = A. Therefore, we are adding 2^**"^ 
equations. 

K80: As {e, (AG)} is a transversal of Gkso \ G^si, we add the equations 

{px = P(AG)x -.xe B}. 

In order to decide whether we are actually adding new equations, we use 
Lemma 4.10. If X e <Bag|ct, we know that {X}k8o = {X}k8i and thus these 
orbits do not give rise to new equations. On the other hand, every orbit 
{X}k8o where X ^ -Bagictj provides a single equation. To avoid repetitions, 
we take X with Xi = A and if T appears in X, there is some C in a preceding 
position. Since X ^ -Bag|ct, the existence and unicity of such an element 
in every GKso-orbit is guaranteed. We are adding 

(1 - 1) X 2"-^ + (2 - 1) X (2^"-=^ - 2"-^) = 2^"-^ - 2"-^ 

new equations. 

JC69: As {e, (AC), (AT)} is a transversal of Gjceg \ Gkso, we add the equations 
{Px = P(AC)x : X e B} U {px = P(AT)x : X e B}. 
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In what follows we use Lemma 4.10 to get rid of redundant equations. 
If X G Bo, then {X}k8o = {(AC)X}k8o = {(AT)X}k8o, so we obtain nothing 
new in this case. 

If X e Bagict \Bo, we add the equations 

PX = P(AT)X- 

To avoid repetitions, we take X of the form (A^) (C™) Xi+^+i . . . x„, where 
Z, m > 1: we are adding 2""^ — 1 new equations. 

By Lemma 4.10, if X G -Bacigt U -BatIcg \ Bq, then the corresponding JC69- 
orbit contains elements of )Bag|ct and therefore these orbits do not provide 
new equations. 

Finally, if X ^ ,62, we add the equations 

PX=P(AC)X PX=P(AT)X- 

Each orbit provides a couple of equations. To avoid repetitions, we 
choose X of the form (A') (C"*) Xi+^+i ■ ■ - x^ (where Z,m > 1) and such 
that if T appears in X, there is some G in a preceding position. The 
number of such equations is (3 - 1) x (2i!!_i±i _ 2"-2) = ^^"'^"+^ - 2"~^ 

□ 

Remark 4.12. A rather different approach based on representation theory can 
be considered to compute the equations for C-^^. We explain the idea roughly: 
for each group G under consideration, take the decomposition of C in isotypic 
components induced by the defining representation of G: 

(7) c = ®t=M 

Then, jC^-^ is just the component corresponding to the trivial representation, 
which maps every clement g E G to the identity map on jC. The maps O^ji — 
]k\ Y.geG^ii9)9 define projections 

Since the isotypic components are orthogonal, we can proceed to systematically 
apply these projections to obtain basis for the non-trivial isotypic components. 
The inner product (■, ■) defined in (5) can then be used to infer a minimal system 
of equations defining C^-^ from these basis. 

Remcirk 4.13. The sets of equations provided in Theorem 4.6 has been success- 
fully used in the paper [KDGCll] for model selection. Although the dimensions 
of these linear spaces are exponential in n, for its biological application one does 
not need to consider all the equations but only those containing the patterns 
observed in the data (in real applications the number of different columns in an 
alignment is really small compared to the dimension of these spaces) . Algorithm 
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SPIn implemented for the purpose of the above paper selects the equations in 
Theorem 4.6 in this way. Moreover, as the equations provided in this theorem 
are binomials, these equations are useful for obtaining the maximum likelihood 
estimate and applying Akaike's information criterion (see [KDGCll] for details). 

Example 4.14. As an example, we compute a minimal system of equations for 
SSM, K81, K80 and JC69 in the case of 3 leaves. 

Equations for C^^^: K^^^ is composed of the following equations: 



Paaa 




Pttt 


Paac 


= Ptig 


Paag 




Paat 


— Ptta 


PkCk 




Ptgt 


PkCC 


= Ptgg 


Pacg 


= Ptgc 


Pact 


= Ptga 


PkGk 




Ptct 


PkGC 


= Ptcg 


Pagg 


= Ptcc 


Pact 


= Ptca 


Paja 




Ptat 


Patc 


= Ptag 


Paig 


= Ptac 


Patt 


= Ptaa 


PCAA 




Pgjt 


PCAC 


= Pgtg 


PCAG 


= Pgtc 


PcAT 


= Pgta 


PCCA 




Pggt 


Pccc 


= Pggg 


Pccg 


= Pggc 


PCCT 


= Pgga 


PCGA 




Pgct 


PCGC 


= Pgcg 


PCGG 


= Pgcc 


PcGT 


= Pgca 


PCTA 




Pgkt 


Pcic 


= Pgag 


PCTG 


= Pgac 


PCTT 


= Pgaa 


for 


£K81 . JgKSl 


is CO 


imposed of E^^" 


together with 




Pkkk 




Pccc 


Paac 


— Pcca 


Paag 


— Peer 


Paat 


— PCCG 


PkCk 




Pcac 


Pacc 


= Pcaa 


Pacg 


= PCAT 


Pact 


= PCAG 


PkGk 




PCTC 


Pagc 


— Pcta 


Pagg 


— Pen 


Pact 


— PCTG 


PkTk 




PCGC 


Patc 


= Pcga 


Patg 


— PCGT 


Patt 


= PCGG 



Equations for C^^° : is composed of E*^^^ together with 

Paag = Pgaa Pacg = Pgca Pact = Pgct 
Paga = Pgag Pagc = Pgac Pagg = Pgaa 

Equations for C^'^^^.- E^ceg jg composed of E"^^" together with 

Paac = Pttc Paca = Ptct Pacc = Ptcc Pacg = Pcag Pacg = Ptcg- 



5. IDENTIFI ABILITY OF PHYLOGENETIC MIXTURES 



Definition 5.1. Given two projective varieties X,Y G P"*, the join of X and Y, 
X V y, is the smallest variety in containing all lines xy with x E X , y E Y 
and X ^ y (see [Har92, 8.1] for the details of this definition). Similarly, we can 
define the join of projective varieties Xi,. . . ,Xh C P"^, Vj^^^Xj, as the smallest 



24 MARTA CASANELLAS, JESUS FERNANDEZ-SANCHEZ, AND ANNA KEDZIERSKA 

subvariety in containing all the linear varieties spanned by Xi, . . . ,Xh with 
Xi e Xi and Xj ^ Xj. It is known that 

h 

dim (V,=i^i) < min dim {Xi) + h-l,m}. 

The right hand side of this inequality is usually known as the expected 
dimension of V^^j^Xj. 

For example, if we consider the join Vf^^Py^ for certain tree topologies Tj 
on the leaf set [n] and a given evolutionary model Ai, then there is a dominant 
rational map 

X X ... X PV^^ X P'^-^ ^ti^V^ C P(£). 

corresponding to the projective closure of the parameterization V ... V (pr^ 
defined by 

ParsM{Ti) X ... X ParsMiTh) x — > C 

where = {^i = (oi, ■ ■ ■ , cih) I = 1} is isomorphic to an affine open subset 

of P''-^ 

In this setting, an /i-mixture on {Ti, . . . , T^} corresponds to a point in the 
variety V^^^PV^. We will use this algebraic variety to study the identifiability 
of phylogenetic mixtures. 

We recall the definition of generic identifiability of the tree topologies on 
/i-mixtures (see for example [APRSIO]). 

Definition 5.2. The tree topologies on /i- mixtures over JH are generically iden- 
tifiable if for any set of trivalent tree topologies Ti . . . , 7)j and generic choice of 
(^1, . . . Ch, a) e ParsM{Ti) x . . . x ParsM{Th) x Q, the equality 

V . . . V (/.T,(ei, . . . e/^, a) = (/-T( V . . . V (t>T'Si[, ...il a'), 

for tree topologies {T{, . . . , T^} and stochastic parameters (^J, • • • C^, a'), implies 

{Ti...,r4 = {r;...,ra. 

In terms of algebraic varieties this is equivalent to saying that the variety V^^^^PV^ 
is not contained in V^^^PV^ and viceversa. 

The tree topologies are the discrete parameters of /i-mixtures. When we 
come to the continuous parameters we have the following definition. 
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Definition 5.3. The continuous parameters on /i-mixtures on Ti, . . . under 

an evolutionary model Ai are generically identifiable if for generic choices of 
stochastic parameters (^i, . . . , i^^, a), the equality 

V . . . V 0T,(6, . . . a, a) = V . . . V (t)T,{^[, . . . il a') 

for stochastic parameters (^^, . . . , a') implies (.^i, . . . ih, a) = (■Ci; • • • ^'hi a') or 
an allowed permutation of the parameters (see [APRSIO, Definition 2]). 

In terms of algebraic varieties, generic identifiability of continuous param- 
eters implies that the generic fibers of the map 0Ti V ... V are finite. In 
particular, the fiber dimension theorem applies (cf. [Har92, Theorem 11.12]) to 
obtain 

h 

dim (vf^iP^Tj = dim (PT^tJ + h-l 

1=1 

The converse of this result (that is, finite generic fibers of (f)T-^ V ■ ■ ■ V (pT,, imply 
generic identifiability) is not necessarily true because a finite fiber can be formed 
by more than one point stochastically meaningful. 

Example 5.4. The tree topologies and the continuous parameters are generically 
identifiable for the unmixed equivariant models JC69, K80, K81, SSM, GMM (see 
[CFSll, Corollary 3.9]). 

If the continuous parameters are generically identifiable under an evolution- 
ary model A4, then the dimension of the variety PV^ is the same for all trivalent 
tree topologies on n taxa and corresponds to the number of free parameters of 
the stochastic model (fiber dimension theorem cf. [Har92, Theorem 11.12]). Let 
djii be this dimension, then we have the following result. 

Theorem 5.5. Let Ai be an evolutionary model for which continuous parameters 
are generically identifiable on trivalent trees and let ho := where dj^ is the 

dimension of PV/^ as above. Then either the continuous parameters or the tree 
parameters are not generically identifiable for h-mixtures under the model M. if 
h > ho. 

Remcirk 5.6. This theorem proves that it makes no sense to do phylogenetic 
inference for /i-mixtures when h > ho- 

Corollary 5.7. Let [n] be a set of taxa and M. be one of the equivariant models 
JC69, K80, K81, SSM, GMM. Then phylogenetic h-mixtures under these models are 
not identifiable for h > Hq where 



• ^0 - i2(2n-3)+4 if M ^GMM, 
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ifM ^SSM, 
ifM ^K81, 
- ifM ^K80, 



ifM ^JC69. 



Proof. Theorem 4.2 shows that >C-^ = T>m and Proposition 4.4 gives the dimen- 
sion of this space in each case. Then, we apply Theorem 5.5 taking into account 
that dam = 12(2n — 3) +3, dssK — 6(2n — 3)-|-l, ciKsi = 3(2n — 3), ciKso — 2(2n — 3) 



Example 5.8. Consider the Kimura 3-parameter model K81 and consider trees 
on n = 4 taxa. Then for any h > A, phylogenetic /i-mixtures are not identifi- 
able (Corollary 5.7). We are not aware of any result proving that mixtures of 
2 or 3 different tree topologies under this model are identifiable (either for tree 
parameters or for continuous parameters). 

Example 5.9. If we consider the Jukes-Cantor model JC69 on n = 4 taxa, then 
Corollary 5.7 tells us that for /i > 3, /i-mixtures are not identifiable. Therefore 
for this particular model on four taxa the identifiability is solved: the tree and 
continuous parameters are generically identifiable for the unmixed model; the tree 
parameters are generically identifiable for 2- mixtures [APRSIO, Theorem 10]; the 
continuous parameters are generically identifiable for 2-mixtures on different tree 
topologies and not identifiable for the same tree topology [APRSIO, Theorem 
23]; either the continuous parameters or the tree topologies are not generically 
identifiable for more than two mixtures (Corollary 5.7). 

Proof of Theorem 5.5. Let edim{h) := hdM + h—1. Then the variety V^^;^PVt. has 
dimension < edim{h). Indeed, as Vj^Ti is a parameterization of an open subset 
of V^^^PV^Ti, then the dimension of Vf^^PV^. is less or equal than J^dimPV^. -\- 
h — 1. Moreover, the dimension of PVy^ is equal to dj^^ if Tj is trivalent (be- 
cause the continuous parameters for the unmixed models we are considering are 
generically identifiable) and is less than dM for non-trivalent trees. Therefore 
dim(V,'LiPyTj < edim{h). 

If we consider only trivalent trees Tj, then ^ dimPVri + /i— 1 = edim{h) and 
therefore dim(vf=iPyTj < edim{h) if and only if dim(V,iiPyTj < EdimPFTi + 
h — 1. Moreover, by fiber dimension theorem applied to V</)t., equality holds if and 
only if the generic fiber of V^t; has dimension 0. In particular, if dim(V^^^PVTj < 
edim{h) then the continuous parameters of this phylogenetic mixture are not 
identifiable. 



and d 



'JC69 — 



2n-3. 



□ 
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If ho = ^j^^ then, edim{ho) = K{dM + 1) - 1 = diml?;^ - 1. Now we 
fix an /i e N with h > ho, so that one has edim{h) > dim(r'^) — 1. 

Two things could happen: 

(a) For all tree topologies {Ti, . . . ,Th} one has dim(V,'LiPyTj < dim(X'A^) - 

1. 

(b) There exists a set of tree topologies {Ti, . . . , T^} for which dim(vf^iPyri) = 
dim('D_A4) — 1- 

Case (a) implies that for any set of trivalent tree topologies {Ti, . . . ,Th} 
one has dim{\/ ^^-JPYtJ < edim{h). And we have seen above that this implies 
that the continuous parameters are not generically identifiable. 

In case (b) one has that Vf^iPVr, = ^Vm)- Indeed, y^i^i^Vr, C ^{Vm) and 
dim(vf^]^PVTj = dim(D^) — 1 = dim(P(Dx)) which implies that both varieties 
coincide (the proper subvarieties of an affine space have dimension strictly smaller 
than it). In particular any other /i- mixture (which is a point in P('D^)) would be 
contained in Vf^^^PV^. and therefore the topologies are not generically identifiable. 
□ 

Remark 5.10. The negative result of Theorem 5.5 should be complemented with 
the following positive result of Rhodes and SuUivant in [RS]: \l M. — GMM and 
one restricts to /i-mixtures on the same trivalent tree topology T, then the tree 
topology and the continuous parameters are generically identifiable if /i < 4^11"^. 
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